-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-23352][PYTHON][BRANCH-2.3] Explicitly specify supported types in Pandas UDFs #20588
[SPARK-23352][PYTHON][BRANCH-2.3] Explicitly specify supported types in Pandas UDFs #20588
Conversation
This PR targets to explicitly specify supported types in Pandas UDFs. The main change here is to add a deduplicated and explicit type checking in `returnType` ahead with documenting this; however, it happened to fix multiple things. 1. Currently, we don't support `BinaryType` in Pandas UDFs, for example, see: ```python from pyspark.sql.functions import pandas_udf pudf = pandas_udf(lambda x: x, "binary") df = spark.createDataFrame([[bytearray(1)]]) df.select(pudf("_1")).show() ``` ``` ... TypeError: Unsupported type in conversion to Arrow: BinaryType ``` We can document this behaviour for its guide. 2. Also, the grouped aggregate Pandas UDF fails fast on `ArrayType` but seems we can support this case. ```python from pyspark.sql.functions import pandas_udf, PandasUDFType foo = pandas_udf(lambda v: v.mean(), 'array<double>', PandasUDFType.GROUPED_AGG) df = spark.range(100).selectExpr("id", "array(id) as value") df.groupBy("id").agg(foo("value")).show() ``` ``` ... NotImplementedError: ArrayType, StructType and MapType are not supported with PandasUDFType.GROUPED_AGG ``` 3. Since we can check the return type ahead, we can fail fast before actual execution. ```python # we can fail fast at this stage because we know the schema ahead pandas_udf(lambda x: x, BinaryType()) ``` Manually tested and unit tests for `BinaryType` and `ArrayType(...)` were added. Author: hyukjinkwon <gurwls223@gmail.com> Closes apache#20531 from HyukjinKwon/pudf-cleanup. (cherry picked from commit c338c8c) Signed-off-by: hyukjinkwon <gurwls223@gmail.com>
cc @ueshin |
LGTM. |
Test build #87333 has finished for PR 20588 at commit
|
def test_simple(self): | ||
from pyspark.sql.functions import pandas_udf, PandasUDFType | ||
df = self.data | ||
def test_supported_types(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I start to worry about the test coverage of vectorized udfs and arrow-based to/from pandas df. Do we have any plan in PySpark to test all the data types?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, agree. Yup, I was thinking of doing it. But if you (or your colleagues) are working on that or have a plan, no need to block it by me :). please go ahead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe open a JIRA and ask the OSS community to do it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup. Filed SPARK-23401.
@HyukjinKwon Could you update the PR description? This will be part of the commit. Thus, it would be nice to document the exact changes made in this PR. |
Yup, will update soon. |
This PR contains multiple fixes. This is not good especially for the ones targeting to 2.3.0. We should split it to multiple independent PRs if possible. cc @ueshin Thanks! Merged to 2.3. |
…in Pandas UDFs ## What changes were proposed in this pull request? This PR backports #20531: It explicitly specifies supported types in Pandas UDFs. The main change here is to add a deduplicated and explicit type checking in `returnType` ahead with documenting this; however, it happened to fix multiple things. 1. Currently, we don't support `BinaryType` in Pandas UDFs, for example, see: ```python from pyspark.sql.functions import pandas_udf pudf = pandas_udf(lambda x: x, "binary") df = spark.createDataFrame([[bytearray(1)]]) df.select(pudf("_1")).show() ``` ``` ... TypeError: Unsupported type in conversion to Arrow: BinaryType ``` We can document this behaviour for its guide. 2. Since we can check the return type ahead, we can fail fast before actual execution. ```python # we can fail fast at this stage because we know the schema ahead pandas_udf(lambda x: x, BinaryType()) ``` ## How was this patch tested? Manually tested and unit tests for `BinaryType` and `ArrayType(...)` were added. Author: hyukjinkwon <gurwls223@gmail.com> Closes #20588 from HyukjinKwon/PR_TOOL_PICK_PR_20531_BRANCH-2.3.
Could you close it? |
This PR contained one targeted change that fixes multiple problems, to be more clear. |
What changes were proposed in this pull request?
This PR backports #20531:
It explicitly specifies supported types in Pandas UDFs.
The main change here is to add a deduplicated and explicit type checking in
returnType
ahead with documenting this; however, it happened to fix multiple things.Currently, we don't support
BinaryType
in Pandas UDFs, for example, see:We can document this behaviour for its guide.
Since we can check the return type ahead, we can fail fast before actual execution.
How was this patch tested?
Manually tested and unit tests for
BinaryType
andArrayType(...)
were added.